Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mango: fix $beginsWith range #4828

Merged
merged 1 commit into from
Nov 3, 2023
Merged

mango: fix $beginsWith range #4828

merged 1 commit into from
Nov 3, 2023

Conversation

willholley
Copy link
Member

@willholley willholley commented Nov 2, 2023

Overview

In the initial implementation of $beginsWith, the range calculation for view indexes mistakenly appends an integer with the size of 8 bits which gets maxed out at FF, rather than building a binary with an extra 3 bytes at the end.

This PR fixes the mango_idx_view:range/5 by correctly appending the U+FFFF code point to create a utf-8 encoded binary. Additionally, the Erlang utf8 binary type ensures the result is a valid utf8 string. If Arg is not a utf8 binary, this will throw a badarg error.

U+FFFF is used instead of U+10FFFF as it's the highest sorting code point according to the collator rules.

We expect Arg strings to be a valid utf8 but, to be safe, mango_selector:norm_ops/1 is enhanced to verify that any argument to $beginsWith is a utf8 string.

Testing recommendations

make mango-test

Related Issues or Pull Requests

Initial implementation: #4810

Checklist

  • Code is written and works correctly
  • Changes are covered by tests
  • Any new configurable parameters are documented in rel/overlay/etc/default.ini
  • Documentation changes were made in the src/docs folder
  • Documentation changes were backported (separated PR) to affected branches

Copy link
Contributor

@pgj pgj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add more unit test cases to cover strings of invalid UTF-8 data.

src/mango/src/mango_idx_view.erl Show resolved Hide resolved
@willholley
Copy link
Member Author

@pgj I'm trying to think where it would make sense to add unit tests for non-UTF8 data. There are no existing eunit tests for field_ranges or range (perhaps we add some?) and I don't think it's possible to generate non-UTF8 data via the integration tests.

@pgj
Copy link
Contributor

pgj commented Nov 2, 2023

If you implement the pre-check that I suggested then it can be established for mango_idx_view:range/5 that the input will always be a valid UTF-8 input hence no tests are needed. The requirement for the test could be then pushed to the validation phase.

When I experimented with this fix locally, I extended mango_selector:norm_ops/1 as follows:

diff --git a/src/mango/src/mango_selector.erl b/src/mango/src/mango_selector.erl
index 93d3b10ca..24660e963 100644
--- a/src/mango/src/mango_selector.erl
+++ b/src/mango/src/mango_selector.erl
@@ -136,7 +136,10 @@ norm_ops({[{<<"$text">>, Arg}]}) when
 norm_ops({[{<<"$text">>, Arg}]}) ->
     ?MANGO_ERROR({bad_arg, '$text', Arg});
 norm_ops({[{<<"$beginsWith">>, Arg}]} = Cond) when is_binary(Arg) ->
-    Cond;
+    case couch_util:validate_utf8(Arg) of
+        true -> Cond;
+        false -> ?MANGO_ERROR({bad_arg, '$beginsWith', Arg})
+    end;
 % Not technically an operator but we pass it through here
 % so that this function accepts its own output. This exists
 % so that $text can have a field name value which simplifies

With that, the existing definition of match_beginswith_test/0 in the same module could be extended naturally.

@willholley willholley marked this pull request as ready for review November 2, 2023 11:32
@willholley willholley force-pushed the mango-beginswith-fixes branch 2 times, most recently from caf1e70 to b2ebc29 Compare November 2, 2023 14:17
@willholley
Copy link
Member Author

@pgj updated as per #4829

% invalid (prefix is not a utf8 string)
?assertThrow(
{mango_error, mango_selector, {invalid_operator, <<"$beginsWith">>}},
check_beginswith(<<123>>, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<<123>> is the field name here not the prefix. Is this a mistake...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my mistake - corrected

@willholley willholley force-pushed the mango-beginswith-fixes branch 2 times, most recently from cc4fa13 to 0142ed8 Compare November 2, 2023 18:34
Copy link
Contributor

@nickva nickva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@nickva
Copy link
Contributor

nickva commented Nov 2, 2023

In the intial implementation of $beginsWith, the range calculation
for json indexes mistakenly appends an integer with the size of
8 bits which gets maxed out at FF, rather than building a binary
with an extra 3 bytes at the end.

This commit fixes the `mango_idx_view:range/5` by correctly appending
the `U+FFFF` code point to create a utf-8 encoded binary. Additionally,
the Erlang `utf8` binary type ensures the result
is a valid utf8 string. If `Arg` is not a utf8 binary, this will
throw a badarg error.

We expect `Arg` strings to be a valid utf8 but, to be safe,
`mango_selector:norm_ops/1` is enhanced to verify
that any argument to `$beginsWith` is a utf8 string.
@willholley willholley merged commit 359cc38 into main Nov 3, 2023
14 checks passed
@willholley willholley deleted the mango-beginswith-fixes branch November 3, 2023 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants